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Abstract 

Current psychiatric research is in crisis. In this review I will describe the causes of this cri- 
sis and highlight recent efforts to overcome current challenges. One particularly promising ap- 
proach is the emerging field of computational psychiatry. By using methods and insights from 
computational cognitive neuroscience, computational psychiatry might enable us to move from 
a symptom-based description of mental illness to descriptors based on objective computational 
multidimensional functional variables. To exemplify this I will survey recent efforts towards this 
goal. I will then describe a set of methods that together form a toolbox of cognitive models 
to aid this research program. At the core of this toolbox are sequential sampling models which 
have been used to explain diverse cognitive neuroscience phenomena but have so far seen little 
adoption in psychiatric research. I will then describe how these models can be fitted to subject 
data and highlight how hierarchical Bayesian estimation provides a rich framework with many 
desirable properties and benefits compared to traditional optimization-based approaches. Finally, 
non-parametric Bayesian methods provide general solutions to the problem of classifying mental 
illness within this framework. 
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1 Part I: Motivation 



Imagine going to a doctor because of chest-pain that has been bothering you for a couple of weeks. 
The doctor would sit down with you, listen carefully to your description of symptoms and prescribe 
medication to lower blood pressure in case you have a heart condition. After a couple of weeks your 
pain has not subsided. The doctor now prescribes medication against reflux which finally seems to help. 
In this scenario not a single medical analysis (e.g. EKG, blood work or a gastroscopy) was performed 
and medication with potentially severe side-effects prescribed on a trial- and-error basis. While highly 
unlikely to occur if you walked into a primary care unit with these symptoms today, this scenario is 
actually a realistic description if you had mental problems and had seen a psychiatrist instead. 



There are several reasons for this discrepancy in sophistication between psychiatry and other fields 
of medicine. The main cause is that by definition, mental illness affects the brain - the most complex 
biological system yet encountered. Compared to the level of scientific understanding achieved on other 
organs of the human body such as the heart, our understanding of the normally functioning brain is 
still, arguably, in its infancy. 



Despite this complexity concerted efforts in the brain sciences have lead to an explosion of knowledge 
and understanding about the healthy and diseased brain in the last decades. The discovery of highly 
effective psychoactive drugs in the 50s and 60s raised expectations that psychiatry would progress in 
a similar fashion. Unfortunately, in retrospect it appears that these discoveries were serendipitous in 



Insel et al. 


2010 


Hyman 


2012 



progress also caused many major pharmaceuticals companies like AstraZeneca and GlaxoSmithKline 



to withdraw from psychiatric drug development and close large research centers (Nutt and Goodwin 



2011 


Cressey 


20111. In sum, psychiatry is a field in crisis ( 


Poland et al. 


1994 


Insel et al. 


2010 



Hyman 


2012 


Sahakian et al. 


2010 



is a lack of measurable quantitative descriptors of mental illness. This lack results from an explanatory 



gap of how basic neurobiological aberrations result in complex disorders of the mind ( Montague et al. 



2011 Hyman 2012). 



4 



In part II I will review current challenges in psychiatry and recent efforts to overcome them. Several 
examples from the domain of decision making show the promise of moving away from symptom-based 
description of mental illness and instead formulate objective, quantifiable computational biomarkers 
as a basis for further psychiatric research. Part III introduces a computational cognitive toolbox that 
is suited to construct these computational biomarkers. Sequential sampling models serve as a case 
study for how computational models, when fit to behavior, have successfully been used to identify and 
quantify latent neurocognitive processes in healthy humans. Baycsian methods provide a resourceful 
framework to fit these models to behavior and establish individualized descriptors of neurocognitive 
function. After establishing the validity of these models to provide neurocognitive descriptors of indi- 
viduals, I will review how clustering techniques can be used to construct a map of individual differences 
based on these neurocognitive descriptors. In sum, the objective of this review is to outline a research 
program to map the domain of neuropsychiatric disease. 



In order to maintain a clear focus on this objective there are certain relevant issues not addressed 
here-within. While of critical importance to psychiatric patients, I will not discuss the clinical, pharma- 
ceutical, environmental, social or developmental aspects of mental illness or rehabilitation programs. 
Moreover, I will treat mental illness as a disease of the brain with a focus on dysfunctional neurocir- 



cuitry (Insel 20101. 



1.1 Current challenges in psychiatry 

While the current crisis in psychiatry has complex causes that are deeply rooted in existing classifica- 



tion systems, one of the core problems is what has been identified by Montague et al. (2011) as the 



explanatory gap. This gap refers to our lack of understanding of the causal processes linking genes, 
molecular, cellular, neurocircuitry and cognition to psychiatric symptoms. This explanatory gap, cou- 



pled with the "(almost) unreasonable effectiveness of psychotropic medication" (Montague et al. , 2011 1 
gave many a false premise to expect progress without understanding. 

This explanatory gap suggests a new approach to research of mental disorders which aims to link 
cognitive and pure neuroscience to clinical symptoms without the restrictions of prior classification 



schemes (Poland and Von Eckardt 2013 Cuthbert and Insel 2010 Robbins et al. 2012). I will next 



specify problems current classification systems introduce followed by recent efforts to address them. 
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1.1.1 Diagnostic and Statistical Manual of Mental Disorders 



For decades the Diagnostic and Statistical Manual of Mental Disorders (DSM) has been the basis 
of clinical diagnosis, treatment and research of mental illness. At its core, the DSM defines distinct 
disease categories like schizophrenia (SZ) and depression. These categories are mainly derived from 



translating subjective experience to objective symptomatology (Nordgaard et al. 2012) assuming un 



specified biological, psychological, or behavioral dysfunctions (Poland et al. 1994). 



While of certain value to clinicians the DSM is specifically designed to also serve as a classifica- 
tion system for scientific research with the goal of more easily translating results directly into clinical 
practice. While this translational research goal is commendable, decisions regarding systematic clas- 



sification are more often based on perceptions of clinical utility rather than scientific merit (Poland 



and Von Eckardt, 2013). As a consequence, DSM-based research programs failed to deliver consistent. 



replicable and specific results (Kendell and Jablensky 2003 Andreasen 2007 Regier and Narrow 



2009 Kendleretal. 2009 Cuthbert and Insel 2010 Hyman 2010). 



Heterogeneity and Comorbidity One major problem of contemporary psychiatry classification 
is the heterogeneus symptomatology of patients receiving identical diagnoses. One striking example 



of this is SZ where one must show at least 2 out of 5 symptoms to receive a diagnosis (Heinrichs 



2001). It is thus possible to have patients with completely different symptomatology being diagnosed 



as schizophrenic. 

Comorbidity is defined as the co-occurrence of multiple diseases in one individual. Importantly, 
we must differentiate between two relevant types of comorbidity: (i) True comorbidity is a result of 
independent disorders co-occurring; (ii) artificial comorbidity on the other hand is a result of separately 



classifying disorders that have a similar pathogenic cascade. It is now widely documented (Markon 



2010 Krueger and Markon 2006 ) that "comorbidity between mental disorders is the rule rather than 



the exception, invading nearly all canonical diagnostic boundaries." ( Buckholtz and Meyer-Lindenberg 



2012). The authors further note that "It is important to understand that comorbidity in psychiatry 



does not imply the presence of multiple diseases or dysfunctions but rather reflects our current inability 
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[to formulate] a single diagnosis to account for all symptoms." Together, these issues belie the as- 
sumption that DSM-disorders represent distinct and independent categories with a unique pathological 



cascade (Krueger and Markon 2006 Hyman 2010 Cuthbert and Insel 2010 Andreasen 2007). 



2 Part II: Potential Solutions 

As outlined above, the short-comings of the current DSM manual are well documented and a need 
for improvement has been recognized. In the following I will outline current efforts to address these 
challenges. 



2.1 Research Domain Criteria Project 

The Research Domain Criteria Project (RDoC) is an initiative by the National Institute for Mental 



Health (NIMH) (Insel et al. 2010). RDoC improves on previous research efforts based on the DSM 
in the following ways. First, as the name implies it is conceptualized as a research framework only 
and is thus clearly separated from clinical practice. Second, RDoC is completely agnostic about DSM 
categories. Instead of a top-down approach which aims at identifying neural correlates of psychiatric 
disease, RDoC suggests a bottom-up approach that builds on the current understanding of neurobio- 
logical underpinnings of different cognitive processes and link those to clinical phenomena. Third, the 
RDoC research program integrates different levels of analysis like imaging, behavior and self-reports. 



At its core, RDoC is structured into a matrix with columns representing different "units of anal- 
ysis" and rows for research domains. The units of analysis include genes, molecules, cells, circuits, 
physiology, behavior, and self-reports. Research domains are clustered into negative and positive va- 
lence systems, cognitive systems, systems for social processes and arousal/regulatory systems. Each 
of these domains is further subdivided into distinct processes; for example, cognitive systems include 
attention, perception, working memory, declarative memory, language behavior and executive control. 



Despite clear improvements over previous DSM-based research programs, the RDoC initiative cur- 



rently lacks consideration of computational descriptors (Poland and Von Eckardt, 2013). As I will 



iSee 



Borsboom et al. 



(2011 1 for an alternative explanation based on causal relationships between symptoms. 
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outline below, computational methods show great promise to help link different levels of analysis, 
elucidate clinical symptoms and identify sub-groups of healthy and patient populations. 



2.2 Neurocognitive phenotyping 



In a recent review article, Robbins et al. (20121 suggest the use of neurocognitive endophenotypes to 



study psychiatric disease: "Neurocognitive endophenotypes would furnish more quantitative measures 
of deficits by avoiding the exclusive use of clinical rating scales, and thereby provide more accurate 
descriptions of phenotypes for psychiatric genetics or for assessing the efficacy of novel treatments. 
The use of such measures would likely also facilitate and improve the use of informative animal models 
in psychiatry by focusing on cognitive and neural processes that can often be investigated in parallel 
across species. Defining such endophenotypes might cut across traditional psychiatric classification, 
and hence begin to explain the puzzle of apparent comorbidities." 



Of particular interest are three studies that use such neurocognitive endophenotypes by constructing 
dimensional functional profiles (MFPs) from summary statistics of a battery of various neuropsycho- 



logical tasks to identify subtypes of ADHD ( Durston et al. 2008 Sonuga-Barke 2005 Fair et al. 2012 ) 



Durston et al. ( 2008 ) argues that there are distinct pathogenic cascades along which abnormalities 



in at least three different brain circuits can lead to similar symptomatology. Specifically, abnormalities 
in dorsal frontostriatal, orbito-frontostriatal, or fronto-cerebellar circuits can lead to impairments of 
cognitive control, reward processing and timing, respectively. Core deficits in one or multiple of these 
brain networks can thus result in a clinical diagnosis of ADHD (see figure [T]) and provides a com- 
pelling explanation for the heterogeneity of the ADHD patient population. Preliminary evidence for 



this hypothesis is provided by Sonuga-Barke (2005) who used principal component analysis (PCA) on 



multi-dimensional functional profiles (based on a neuropsychological task battery) of ADHD patients 
and identified 3 distinct sub-types co-varying on timing, cognitive control, and reward. 



A similar approach of identifying clusters in the ADHD population using MFPs was taken by |Fair| 



et al. (20121. While similar in spirit to Sonuga-Barke (2005) and Durston et al. (2008), Fair et al. 



(20121 do not only look at differences in the patient population but in both, healthy controls (HC) 
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Figure 1: Abnormalities in distinct brain areas (bottom level) can lead to different cognitive impair- 
ments (2nd level) and result in an ADHD diagnosis (top level) . Figure reproduced from Durston et al. 
p08|. 



and ADHD patients. The clustering is achieved by the application of graph theory. Interestingly, 
the authors find that HC and ADHD is not the predominant dimension along which clusters form. 
Instead, the authors uncover different functional profiles that apply to both, HC and ADHD patients. 
Critically, a classifier trained to predict HC and ADHD subjects inside of individual profiles achieved 
better performance than a classifier trained on the aggregated data. In addition, the abnormalities in 
cognitive functions of ADHD patients were different across different clusters (e.g. one cluster might 
show differences in response inhibition while another one shows differences in RT variability; see figure 
2j3). In other words, this implies that the overall population clusters into different cognitive profiles. 
ADHD affects individuals differently based on which cognitive profile they exhibit. Importantly, this 
study suggests that the source of heterogeneity may not only be distinct pathogenic cascades being 
labeled as the same disorder but may actually be a result of the inherent heterogeneity present in the 
overall population - healthy and diseased. 



The above mentioned studies all exemplify the danger of lumping subjects at the level of disease and 
treating them as one homogeneous category with a single, identifiable pathological cascade. Instead, 
these studies use MFFs to find an alternative characterization of subjects independent of their DSM 
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Figure 2: Profile differences between ADHD and healthy controls (TDC) on based on functional 
descriptors. Lower left corner describes which each point along the x-axis represents. Upper left 
corner represents overall profile-differences. Right side shows differences between ADHD and HC 
functional profiles inside various clusters (i.e. profiles). As can be seen, group-differences vary over 
different profiles. Reproduced from Fair et al. (2012). 



classification that is (i) quantitatively measurable and (ii) a closer approximation to the underlying 



neurocircuitry (Robbins et al. 2012). 

While a clear improvement on previous research efforts that use the DSM diagnosis as the sole de- 
scriptor this approach still has problems. First, although there is less reliance on DSM categories, these 
studies still use the diagnostic label for recruiting subjects. It could be imagined, for example, that 
patients with similar impulse control disorders like OCD or Tourette's have abnormalities in similar 
brain circuits; thus, if only OCD patients are recruited a critical part of the picture might be missed. 
Second, the cognitive task battery only covers certain aspects of cognitive function. Other tasks that 
for example measure working memory or reinforcement learning would be a useful addition. Finally, 
performance on each individual task is assessed by an aggregate performance score. Recent behavioral 
and neuropsychological findings, however, suggest that executive control in a single task may instead 



be more accurately characterized as a collection of related but separable abilities (Baddeley 1966 



Collette et al. 2005), a pattern referred to as the unity and diversity of executive functions (Duncan 



et al. 


1997 


Miyake et al. 


2000 



tion of various neural networks and cognitive processes (see e.g. Collins and Frank 2012). This task 
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impurity problem (Burgess 
based solely on MFPs. 



1997 



Phillips 



19971 complicates identification of separate brain circuits 



In sum, while cognitive phenotypes provide a useful framework for measuring brain function there 
is still ambiguity when using behavioral scores that present an aggregate measure of various brain 



networks. This issue is also discussed by Buckholtz and Meyer-Lindenberg (2012) in relation to comor- 
bidity: "The fact that a brain circuit can be involved in multiple cognitive domains helps explain why 
diverse psychiatric disorders can exhibit common deficits and symptoms (comorbidity)." This "com- 
mon symptom, common circuit" model of psychopathology is illustrated by figure [Sj Disentangling 
these transdiagnostic patterns of psychiatric symptoms thus requires identification and measurement 
of underlying brain circuits. While the authors propose the use of functional imaging studies and ge- 
netic analysis I will discuss how computational modeling can contribute to disambiguate the multiple 
pathways leading to behavioral features. 



2.3 Computational psychiatry 

How have other brain sciences dealt with one-to-many mapping problem trying to dissociate behavior 
on a cognitive task with brain circuits? Computational models at different levels of abstraction have 
had tremendous impact on the field of cognitive neuroscience. The aim is to construct a model based 
on integrated evidence from neuroscience and psychology to explain neural activity as well as cognitive 
behavior. While more detailed biologically inspired models such as biophysical and neural network 
models are generally more constrained by neurobiology they often have many parameters which make 
it very difficult to fit them directly to human behavior. More abstract, process based models on the 
other hand often have fewer parameters that allow them to be fit directly to data at the cost of being 
less detailed about the neurobiology. Critically, all of these models allow for increased specificity in the 
identification of different neuronal and psychological processes that are often lumped together when 
analyzing task behavior based on summary statistics. 



Using computational models to infer dysfunctional latent processes in the brain is the premise of the 
newly emerging field of computational psychiatry. In their groundbreaking review, [Montague et aL\ 



(20111 define the goal for computational psychiatry of "extracting computational principles around 
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Figure 3: Pathonegic cascade from Brain Circuit (BC) over multiple related cognitive processes (Cl- 
C3) to symptoms (Sa-Si). "Some of these symptoms will constitute diagnostic criteria for categorical 
disorder A but not disorder B (yellow shading), and some symptoms will be relatively selective for 
disorder B but not disorder A (red shading). However, the plurality of symptoms will overlap the 
two diagnostic categories (transdiagnostic symptoms, orange shading). This highlights the idea that 
connectivity circuits convey cognitive and symptom domain-specific, but disorder-general, genetic risk 
for mental illness." Reproduced from Buckholtz and Meyer-Lindenberg (2012). 
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Figure 4: Overview of computational psychiatry. Different computational tools like computational 



modeling contribute to insights in mental health research. Reproduced from Montague et al. (2011 ) 



which human cognition and its supporting biological apparatus is organized. Achieving this goal will 
require new types of phenotyping approaches, in which computational parameters are estimated (neu- 
rally and behaviorally) from human subjects and used to inform the models. This type of large-scale 
computational phenotyping of human behavior does not yet exist." Also, see figure [4] for a visual 
depiction. 



Based on this premise, Maia and Frank (2011 1 identify computational models as a "valuable tool in 
taming [the complex pathological cascades of mental illness] as they foster a mechanistic understanding 
that can span multiple levels of analysis and can explain how changes to one component of the system 
(for example, increases in striatal D2 receptor density) can produce systems-level changes that translate 
to changes in behavior" . Moreover, the authors define three concrete strategies for how computational 
models can be used to study brain dysfunction (see also figure [S]) given a model of normal function: 

• Deductive approach: Established neuronal models can be tested for how pathophysiologically 
plausible alterations in connectivity or neurotransmitter levels (e.g. dopamine is known to be 
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reduced in Parkinson's disease) affect system level activations and behavior. This is essentially 
a bottom-up approach as it involves the study of how known or hypothesized neuronal changes 
affect higher-level functioning. 

• Abductive approach: Computational models can be used to infer neurobiological causes from 
known behavioral differences. In essence, this is a top-down approach which tries to link behav- 
ioral consequences back to underlying latent causes. 

• Quantitative abductive approach: Parameters of a computational model are fit to a subjects' 
behavior on a suitable task or task battery. Different parameter values point to differences in 
underlying neurocircuitry of the associated subject or subject group. These parameters can either 
be used comparatively to study group differences (e.g. healthy and diseased) or as a regressor 
with e.g. symptom severity. This approach is more common with abstract models than with 
neural network models as the former have often fewer parameters and thus can be more easily 
fit to data. 

2.3.1 Case studies in the domain of decision making 

One key area in which computational models have had tremendous success in elucidating how the 
different cognitive and neurobiological gears work together is the domain of decision making. In addi- 
tion, many mental illnesses can be characterized by aberrant decision making of one sort or another 



Maia and Frank 


2011 Wiecki and Frank 


2010 


Montague et al. 2011 



. In the following I will review 



recent cases where computational models of decision making have been used to better understand 
brain disorders. 



Computational models of reinforcement learning 



Parkinson's Disease Our first case study concerns Parkinson's disease (PD). Its most visible 
symptoms affect the motor system as manifest as hypokinesia, bradykinesia, akinesia, rigidity, tremor 
and progressive motor degeneration. However, recently, cognitive symptoms have received increased 



attention (e.g., |Cools[ |2005[ |Frank[ |2005| [Moustafa et al.[ |2008[ [Cunha et al.[|2009[ ). PD is an intrigu- 
ing neuropsychiatric disorder because its pathogenic cascade is well identified to be the cell death of 
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Figure 5: Different approaches computational models can inform mental health research. Given a 
computational model of normal function (a), research can provide a mechanistic bridge from neural 
abnormalities to explain or compare behavioral differences in a deductive approach (b). Contrary, in 
an abductive approach (c) behavioral differences are used to infer underlying neuronal abnormalities. 
Similarly, computational models can be quantitatively fit to behavior to infer underlying causes (i.e. 
quantitative abductive approach; d). See text for more information. Reproduced from Maia and Frank 

(IMTl). 
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midbrain dopaminergic neurons in the substantia nigra pars compacta (SNc), part of the basal ganglia 



(BG) ( |Kish et al.[ |1988[ ). Neural network models of the BG ( |Frank[ |2005[ |2006[ ) interpret this brain 
network as an adaptive action selection device that conditionally gates internal or external actions 
based on their previous reward history. DA is critically involved in learning from rewards and punish- 



ments which actions to facilitate and which actions to suppress in the future (Ljungberg et al. 1992 



[Montague eFaTI [19971 fSchultzj [19981 jWaelti et aT| [200T| [Pan et al.[ [20051 Payer et aL| [20071 |Roeschl 



Montague et al. 


1997 


Schultz et al. 


1997 



ioral reinforcement learning tasks show that the chronic low levels of DA in PD patients result in a 



bias towards learning from punishment at the cost of learning from rewards ( Frank et al. 2004 Cohen 



and Frank 2009). In extension, we have argued that PD is not a motor disorder per se but rather an 



action selection disorder in which the progressive decline of motor and cognitive function can be inter- 



preted in terms of aberrant learning not to select actions (Wiecki and Frank 2010 Wiecki et al. 2009) 



In this case study, an existing biological model of normative brain function was paired with a 
known and well localized neuronal dysfunction to extend our understanding of the symptomatology of 
a brain disorder. Note, however, that the model was not fit to data quantitatively. In the terminology 



established by Maia and Frank (2011), this is an example of the deductive approach in which the 
model provides a mechanistic bridge that explains how abnormal behavior can result from neurocircuit 
dysfunctions. 



Depression Our second case study involves how depression may be understood as an action 
planning disorder. Planning a series of future actions is an exponentially complex problem as each 
individual action can have different outcomes which themselves enable new actions essentially forming 
a decision tree. One way to deal with this complexity is to prune the decision tree and not consider 



certain actions (see figures). Recently, Huys et al. (20121 proposed a model of how humans perform 



this approximation. Briefiy, the authors suggest that actions that would lead to comparatively bad 
outcomes will not be further considered and pruned from the decision tree (note that this is provably 
not optimal as early bad outcomes could be more than accounted for later on) . To test this theory the 
authors test healthy subjects on a novel behavioral task that requires execution of a sequence of actions 
each associated with winning or losing a certain number of points. Critically, in one task condition. 
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Figure 6: Depiction of decision trees with 2 possible actions (pressing of 'U' key (green) or T key 
(yellow) . Each action (starting at the top node) leads to winning or losing points and subsequent action 
choice. After 3 subsequent actions points are aggregated. Critically, subjects experience rewards and 
have to learn the reward structure (which remains stable across a number of trials) to maximize rewards. 
Complete enumeration of a decision tree (left) has exponential complexity and is thus computationally 
infeasible. Thus, the decision tree must be pruned at a certain point. One possible strategy is to prune 



once a large negative reward is encountered (as depicted). Reproduced from Huys et al. (2012). 



subjects had two action paths available to them: (i) one path in which a very large loss occurred early 
that, however, was more than accounted for later on; and (ii) another less favorable one action path 
without an early loss but lower cumulative rewards overall. Participants overwhelmingly chose the 
non-optimal action sequence that did not result in an early large loss suggesting that humans prune 
the decision tree once actions are encountered leading to unfavorable outcomes, even in scenarios where 
this strategy does not result in optimal performance. 



The authors formulate, fit and compare various computational algorithmic models with different 
pruning strategies. The model that provided the best fit (measured with an integrated BIC score 
that penalizes model complexity; see below) had three parameters: specific pruning, general pruning 
and reward sensitivity. Specific pruning represents the probability of a participant to stop evaluating 
sub-trees that are associated with large losses while general pruning represents the probability to stop 
considering sub-trees irrespective of rewards. Reward sensitivity captured the tendency to evaluate a 
loss of e.g. -140 to be of larger absolute magnitude than a win of 140 points although they cancel each 
other (i.e. loss aversion). 

Participants were also given questionaires to assess their (sub-clinical) levels of depression (BDI). 
Intriguingly, this depression rating correlated with the specific pruning parameter of the model (see 
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Figure 7: Correlation between specific pruning parameter A5 and residual depression rating (BDI). 
Each circle represents one subject. Diameter of the circle corresponds to estimation uncertainty. See 



text for further model details. Reproduced from Huys et al. (2012) 



figure [7]). This effect was specific to this parameter as there was no such correlation with the general 
pruning parameter. In other words, subjects with higher depressive ratings were quicker to discard 
plans that lead to bad outcomes early on ~ sometimes missing large rewards available to them later on. 
Finally, the authors speculate that the specific pruning depends on the serotonergic transmitter system. 



In sum, Huys et al. (2012) proposed an algorithmic model of normative cognitive computation. 



By quantitatively fitting the model to behavioral data on a novel task and regressing an independent 
clinical variable (rating on the depression scale) with the fitted parameter values the authors are able 
pinpoint the cognitive computation underlying a clinically significant symptom. In terms of the |Maia| 



and Frank (2011) terminology, Huys et al. (2012) used a quantitative abductive approach 



Schizophrenia Despite SZ being the focus of intense research over the last decades, no single 
theory of its underlying neural causes has been able to explain the diverse set of symptoms that can 
lead to a SZ diagnosis. The symptomatology is structured in terms of positive symptoms like psychosis, 
negative symptoms like anhedonia which refers to the inability to experience pleasure from activities 
usually found enjoyable such as social interaction. 



Recent progress has been made by the application of RL models to understand individual symptoms 



or a single symptom category (e.g. negative symptoms) rather than SZ as a whole (Waltz et al. 2011 



Gold et ah 2008 2012 Strauss et al 2011). 
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In a recent behavioral study using a RL task, Waltz et al. (2007) found that SZ patients show 
reduced performance in selecting previously rewarded stimuli compared with HCs. Moreover, this 
performance deficit is most pronounced in patients with severe negative symptoms. Notably, SZ 
and HC did not differ in their ability to avoid actions leading to negative outcomes. However, this 
behavioral analysis did not allow to differentiate whether SZ patients were impaired at learning from 
positive outcomes or from a failure in representation of the prospective reward values during decision 
making. 

This dichotomy in learning vs representation is also present in two types of RL models - actor-critic 



and Q-learning models (Sutton and Barto 1998). An actor-critic model consists of two modules: an 
actor and a critic. The critic learns the expected rewards of states and trains the actor to perform 
actions that lead to better-than expected outcomes. Q-learning models on the other hand have an 
explicit representation of the outcomes that are associated with each action. Thus, while a Q-learning 
model chooses actions based on their absolute reward values, an actor-critic chooses actions based on 
whether they lead to better-than-expected outcomes. 



In a follow-up study, Gold et al. (2012) administered a new task that paired a neutral stimulus 
in one context with a positive and in another context with a negative stimulus. While the neutral 
stimulus has the same value of zero in both contexts, it is known that DA signals reward prediction 



errors (RPE) that drive learning in the BG are coding outcomes relative to the expected reward ( Sut- 



ton and Barto| 1990 


Barto 


1995 


Montague et al. 


1997 


Schultz et al. 



context, receiving nothing is better than expected and will result in a positive RPE, driving learning 



in the BG to select this action in the future (Frank, 2005). In a test period in which no rewards were 
presented, participants had to choose between an action that had been rewarding and one that had 
simply avoided a loss. Both actions should have been associated with better-than-expected outcomes. 
An actor-critic model should thus show a tendency to select the neutral stimulus while a Q-learning 
model with explicit representation of the reward contingencies should mainly select the one with a 
higher reward. Intriguingly, when both of these models were fit to participant data, the actor-critic 
model produced a better fit for SZ patients with high degree of negative symptoms while HC and SZ 
with low negative symptoms were better fit by a Q-learning model. In other words, the high negative 
symptoms group largely based decisions on learned stimulus-response associations instead of expected 
reward values. Notably, HC and the low negative symptom group did not differ significantly in their 
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RL behavior. Moreover, this also rules out possibly confounding effects of antipsychotic drugs, as both 
patient groups were similarly medicated. 



In a related line of work, Strauss et al. (2011 ) tested HC and SZ patients on a reinforcement learn- 



ing task that allowed subjects to either adopt a safe strategy and exploit the rewards of actions with 
previously experienced rewards, or, to explore new actions with perhaps even higher payoffs. |Frank| 



et al. (2009) develop a computational model that can be fit to subjects' behavior and recover how 



individual subjects balance this exploration-exploitation trade-off. Intriguingly, applying this model 



to SZ patients, Strauss et al. (2011 ) found that patients with high anhedonia ratings where less willing 



to explore their environment and uncover potentially better actions. This result suggests a reinterpre- 
tation of the computational cognitive process underlying anhedonia. For example, one might assume 
that the lack of engagement of social activities of anhedonistic patients results from an inability to 
experience pleasure and as a consequence, a failure to learn the positive value of social interaction. 
Instead, this study suggests that anhedonia is a result of an inability to consider the prospective benefit 
of doing something that might lead to better outcomes. 



In sum, Gold et al. (2012) and Strauss et al. (2011) used a quantitative abductive approach to infer 



aberrant computational cognitive processes in RL in a subgroup of SZ patients. By grouping subjects 
according to symptom severity instead of diagnosis the authors addressed the problem of heterogeneity. 



Moreover, linking the results to prior neural network modeling efforts (Frank and Glaus 2006), as 



described above, also points to the OFG as a promising target for further investigation as a neural 
source of negative symptoms in SZ patients. 



Computational models of response inhibition Besides RL, response inhibition is another widely 
studied phenomenon in cognitive neuroscience of relevance to mental illness. Response inhibition is 
required when actions in the planning or execution stage are no longer appropriate and must be 
suppressed. The antisaccade task is one such task that is often used in a psychiatric setting (e.g 



Aichert et al. 2012 Fukumoto-Motoshita et al. 2009). It requires subjects to inhibit a prepotent 



response to a salient stimulus and instead saccade to the opposite side (Hallett 1979). A wealth 



of literature has demonstrated reduced performance of psychiatric patients with disorders including 
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attentiondeficit/hyperactivity disorder (ADHD) ( Nigg[ 2001 Oosterlaan et al. 1998 Schachar and 



Logan[ 19901, obsessive compulsive disorder (OCD) (Chamberlain et al. 2006 Menzies et al. 2007 



Penades et~al| |2007[ [Morein-Zamir et al.[ |2009| ), schizophrenia (SZ) ( |Huddy et aLj |2009[ [Bellgrove 



et al. 2006 Badcock et al. 2002), Parkinson's disease (PD) (van Koningsbruggen et al. 2009) and 



substance abuse disorders (Monterosso et al. 2005 Nigg et al. 2006). However, as demonstrated by 



Wiecki and Frank (2013), even a supposedly simple behavioral task such as the antisaccade task re- 



quires a finely orchestrated interplay between various brain regions including frontal cortex and basal 
ganglia. It thus can not be said that decreased accuracy in this task is evidence of response inhibition 
deficits per se as the source of this performance impairment can be manifold. 

In follow-up work, we have formulated a psychological process model which summarizes the higher- 
level computations of the neural network and has fewer parameters. In an attempt to bridge these two 
levels of abstraction, we fit this process model to the outputs of this neural network model for which 



the biological modulations can be tightly controlled (Wiecki and Frank 2010). Interestingly, by mod 



ulating different biological parameters in the neural network model and recovering which parameter of 
the process model was affected by this modulation, we were able to associate high-level computational 
processes with their neural correlates. The hope is that these associations, once validated, allow us to 
infer specific neural aberrations from behavioral performance. 



In sum, computational models like the DDM allow mapping of behavior to psychological processes 
and could thus be categorized as the computational abductive approach. However, ambiguity of how 
psychological processes relate to the underlying neurocircuitry still have to be disambiguated. By com- 
bining different levels of modeling these ambiguities can be better identified and studied. Ultimately, 
this might allow development of tasks that use specific conditions (e.g. speed-accuracy trade-off, reward 
modulations and conffict) to disambiguate the mapping of psychological processes to their neurocir- 
cuitry. Using biological process models to test different hypotheses about the behavioral and cognitive 
effects of neurocircuit modulations would correspond to the deductive approach. In other words, by 



combining the research approaches outlined by Maia and Frank (2011) we can use our understanding 



of the different levels of processing to inform and validate how these levels interact in the healthy and 
dysfunctional brain. 
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In sum, there arc a few example studies which apphed estabHshed computational models to identify 
model parameters (which aim to describe a single cognitive function) and relate it to the severity of 
a specific clinical symptom. In the following, I will review sequential sampling models and Bayesian 
methods and show how they are well suited to solve many of the issues encountered in computational 
psychiatry. 

3 Part III: Quantitative Methods 

Part I described the current issues in psychiatry. Several examples presented in part II highlighted 
computational psychiatry as an area with a lot of potential to solve these problems. In part III 
I will review several quantitative methods that can solve the problems associated with quantifying 
cognitive function. Specifically, sequential sampling models present a versatile tool to model cognitive 
function. Fitting these models to data - especially with the small number of trials often found in clinical 
experiments - is a challenge well addressed by hierarchical Bayesian models that share statistical power 
by assuming similarity between subjects. However, there are two limitations with this basic approach, 
(i) Traditional inference methods require a likelihood function which is often intractable for more 
nuanced formulations of sequential sampling models. Likelihood free methods solve this problem as 
they only require a generative process from which the likelihood is estimated, (ii) While the similarity 
assumption made by hierarchical Bayesian modeling is reasonable we can not know the exact form of 
this similarity ahead of time. Bayesian mixture models address this problem by inferring clusters from 
the data. While traditional methods like Gaussian Mixture Models require specification of the number 
of clusters to be found in the data ahead of time, Bayesian non-parametrics relax this restriction and 
infer the number of clusters from the data. In the following I will focus on how each quantitative method 
helps to solve the above mentioned issues. Mathematical details can be found in the appendix. Finally, 
while these methods are described with the motivation of estimating sequential sampling models, the 
Bayesian methods are applicable more broadly (e.g. to the estimation of RL models). 
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3.1 Sequential Sampling models 



Cognition spans many mental processes that include attention, social cognition, memory, emotion, and 
reasoning, to name a few. As outlined above, RL models have already proven to be a valuable tool in 
explaining neuropsychological disorders and their symptoms. A computational psychiatric framework 
that aims to explain the multi-faceted domain of mental illness must thus include computational cog- 
nitive neuroscience models that cover a broad range of cognitive processes. As such a review is outside 



of our scope (but see O'Reilly and Munakata (e.g. 2000)) I will focus on sequential sampling models as 
an illustrative example for how these models work, how they have been applied to study normal and 
aberrant neurocognitive phenomena, how they can be fit to data using Bayesian estimation and how 
subgroups of subjects can be inferred using mixture models. 



Sequential sampling models (SSMs) (e.g. Townsend and Ashby 1983 1 have established themselves 



as the de-facto standard for modeling data from simple decision making tasks (e.g. Smith and Ratcliff 



2004 1 . Each decision is modeled as a sequential extraction and accumulation of information from the 



environment and/or internal representations. Once the accumulated evidence crosses a threshold, a 
corresponding response is executed. This simple assumption about the underlying psychological pro- 
cess has the intriguing property of reproducing reaction time distributions and choice probability in 
simple two-choice decision making tasks. 



SSMs generally fall into one of two classes: (i) diffusion models which assume that relative evidence 
is accumulated over time and (ii) race models which assume independent evidence accumulation and 



response commitment once the first accumulator crossed a boundary (e.g. LaBerge 1962 Vickers 



1970 1. While there are many variants of these models they are often closely related on a computational 



level and sometimes mathematically equivalent under certain assumptions (Bogacz et al. 2006). As 



such, I will restrict discussion to two exemplar models from each class widely used in the literature: 
the drift diffusion model (DDM) ( [Ratcliff and Rouder[ [l998{ [Ratcliff and McKoonl |2008[ ) belonging to 



the class of diffusion models and the linear ballistic accumulator (LB A) ( Brown and Heathcote 2008 ) 
belonging to the class of race models. 
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3.1.1 Drift DiflFusion Model 

The DDM models decision making in two-choice tasks. Each choice is represented as an upper and 
lower boundary. A drift-process accumulates evidence over time until it crosses one of the two bound- 



aries and initiates the corresponding response (Ratcliff and Rouder 1998 Smith and Ratclift 2004) 



The speed with which the accumulation process approaches one of the two boundaries is called the 
drift rate and represents the relative evidence for or against a particular response. Because there is 
noise in the drift process, the time of the boundary crossing and the selected response will vary between 
trials. The distance between the two boundaries (i.e. threshold) influences how much evidence must 
be accumulated until a response is executed. A lower threshold makes responding faster in general 
but increases the influence of noise on decision making while a higher threshold leads to more cautious 
responding. Reaction time, however, is not solely comprised of the decision making process - per- 
ception, movement initiation and execution all take time and are summarized into one variable called 
non-decision time. The starting point of the drift process relative to the two boundaries can influence 
if one response has a prepotent bias. This pattern gives rise to the reaction time distributions of both 
choices (see figure [s] mathematical details can be found in the appendix) . 

Later on, the DDM was extended to include inter-trial variability in the drift-rate, the non-decision 
time and the starting point in order to account for two phenomena observed in decision making tasks 



early and late errors. Models that take this into account are referred to as the full DDM (Ratcliff 



and Rouder 1998). 



3.1.2 Linear Ballistic Accumulator 



The Linear Ballistic Accumulator (LBA) model belongs to the class of race models ( Brown and Heath- 



cote 



2008). Instead of one drift process and two boundaries, the LBA contains one drift process 
for each possible response with a single boundary each. Thus, the LBA can model decision making 
when more than two responses are possible. Moreover, unlike the DDM, the LBA drift process has 
no intra-trial variance. RT variability is obtained by including inter-trial variability in the drift-rate 
and the starting point distribution (see figure [9]). Note that the simplifying assumption of a noiseless 
drift-process simplifies the math significantly leading to a computationally faster likelihood function 
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1 2 3 4 5 

time (sees) 

Figure 8: Trajectories of multiple drift-processs (blue and red lines, middle panel). Evidence is ac- 
cumulated over time (x-axis) with drift-rate v until one of two boundaries (separated by threshold 
a) is crossed and a response is initiated. Upper (blue) and lower (red) panels contain histograms 
over boundary-crossing-times for two possible responses. The histogram shapes match closely to that 
observed in reaction time measurements of research participants. 
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Figure 9: Two linear ballistic accumulators (left and right) with different noiseless drifts (arrows) 
sampled from a normal distribution initiated at different starting points sampled from uniform distri- 
bution. In this case, accumulator for response alternative 1 reaches criterion first and gets executed. 
Because of this race between two accumulators towards a common threshold these model are called 



race-models. Reproduced from Donkin et al. (20111 



for this model. 

In a simulation study it was shown that the LBA and DDM lead to similar results as to which 



parameters are affected by certain manipulations (Donkin et al. 2011). 



3.1.3 Relationship to cognitive neuroscience 

SSMs were originally developed from a pure information processing point of view and primarily used 
in psychology as a high-level approximation of the decision process. More recent efforts in cognitive 
neuroscience have simultaneously (i) validated core assumptions of the model by showing that neurons 



indeed integrate evidence probabilistically during decision making ( Smith and Ratcliff 2004 Gold and 



Shadlen 2007) and (ii) applied this model to understand and describe neural correlates of cognitive 



processes (e.g. Forstmann et al. 2010a Cavanagh et al. , 2011). 



Multiple routes to decision threshold modulation have been identified. Decision threshold in the 
speed-accuracy trade-off is modulated by changes in the functional connectivity between pre-SMA and 



striatum (Forstmann et al. 2010a). Neural network modeling (Frank 2006 Ratcliff and Frank 2012) 
validated by studies of PD patients with a deep-brain-stimulator (DBS) in their subthalamic nucleus 
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(STN) (Frank et al. 2007a) suggest that this node is imphcated in raising the decision threshold 



when there is conflict between two options associated with similar rewards. This result was further 



corroborated by Cavanagh et al. (2011) who found that frontal theta power (as measured by elec- 



troencelophagraphy and thought to correspond to conflict (Cavanagh et al. 2012)) is correlated with 
decision threshold increase on a trial-by-trial basis. As predicted, this relationship was broken in PD 
patients with DBS turned on (but, critically, not when DBS was turned off thus showing the effect 
is not a result of the disease). In other words, by interfering with STN function through stimulation 
we were able to show that this brain area is causally involved in decision threshold modulation de- 
spite intact experience of conflict (as measured by theta power). Interestingly, these results provide a 
computational cognitive explanation for the clinical symptom of impulsivity observed in PD patients 



receiving DBS iFrank et al. 2007a). 



3.1.4 Application to computational psychiatry 

Despite its long history, the DDM has only recently been applied to the study of psychopathology. For 
example, threat/no-threat categorization tasks (e.g. "Is this word threatening or not?") are used in 
anxiety research to explore biases to threat responses. Interestingly, participants with high anxiety are 
more likely to classify a word as threatening than low anxiety participants. One hypothesis assumes 
that this behavior results from an increased response bias towards threatening words in anxious people 



(Becker and Rinck 2004 Manguno-Mire et al. 2005 Windmann and Kriiger 1998). Using DDM 



analysis, [White (20091 showed that instead of a response bias (or a shifted starting-point in DDM 



terminology), anxious people actually showed a perceptual bias towards classifying threatening words 
as indicated by an increased DDM drift-rate. 



In a recent review article. White et al. (2010) use this case-study to highlight the potential of the 



DDM to elucidate research into mental disease. Note that in this study the authors did not attempt to 
hypothesize about the underlying neural cause of this threat-bias. While there is some evidence that 



bias in decision making is correlated with activity in the parietial network (Forstmann et al. 2010b) 



this was not tested in respect to threatening words. Ultimately, I suggest that this research strategy 
should be applied to infer neural correlates of psychological DDM decision making parameters using 
functional methods like fMRI to the study neuropsychopathology (as outlined above). 
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The DDM has also been successfully used to show that ADHD subjects were less able to raise 



their decision threshold when accuracy demands were high (Mulder et al. 20101. Interestingly, the 
amount by which ADHD subjects failed to modulate their decision threshold correlated strongly with 
patients' impulsivity/hyperactivity rating. Moreover, this correlation was specific to impulsivity as no 
correlation was found between decision threshold modulation and inattentiveness. 

In sum, SSMs show great promise as a tool for computational psychiatry. However, their applicabil- 
ity depend on the ability to accurately estimate them to construct individual CMDFs. In the following 
I will review different parameter estimation techniques. Special focus will be given to Bayesian meth- 
ods. Finally, once SSMs can be fit accurately the question arises how to construct a map of mental 
illness based on these CMDF. Towards this goal I will review clustering methods that can be expressed 
in the Bayesian framework. 

3.2 Parameter estimation 

To identify computational biomarkers in a variable clinical population with the DDM it is critical to 
have robust and sensitive estimation methods. In the following I will describe traditional parameter 
estimation methods and their pitfalls. I will then argue how Bayesian estimation provides a complete 
framework that avoids these pitfalls. 

3.2.1 Maximum Likelihood and 

Traditionally, fitting of computational models is treated as an optimization problem in which an ob- 
jective function is minimized. Different objective functions for the DDM have been proposed. Most 
common is the quantile method which calculates the quantiles of the DDM likelihood and uses the 



statistic as the objective to compare it to the quantiles observed in the data (Ratcliff and Tuerlinckx 



2002 1 . While this approach requires a fair amount of data to get stable quantile estimation it naturally 
deals with outlier RTs as the quantiles average extreme values out. In comparison, maximimum likeli- 
hood estimation (MLE) directly uses the DDM likelihood function derived from the DDM generative 



process (Navarro and Fuss 2009). 
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Psychological experiments often test multiple subjects on the same behavioral task. Models are 
then either fit to individual subjects or to the aggregated group data. Both approaches are not ideal. 
When models are fit to individual subjects we neglect any similarity the parameters are likely to have. 
While we do not necessarily have to make use of this property to make useful inference if we have lots 
of data, the ability to infer subject parameters based on the estimation of other subjects generally leads 



to more accurate parameter recovery ( Sofer et al. I in cases where little data is available as is often 



the case in clinical and ncurocoognitive experiments. One alternative is to aggregate all subject data 
into one meta-subject and estimate one set of parameters for the whole group. While useful in some 
settings, this approach is unsuited for the setting of computational psychiatry as individual differences 
play a huge role. 

3.2.2 Hierarchical Bayesian models 

Statistics and machine learning have developed efficient and versatile Bayesian methods to solve various 



inference problems ( Poirier 2006 ) . More recently, they have seen wider adoption in applied fields such 



as genetics ( Stephens and Balding 2009) and psychology (e.g. Clemens et al. , 20111. One reason for this 
Bayesian revolution is the ability to quantify the certainty one has in a particular estimation. Moreover, 
hierarchical Bayesian models provide an elegant solution to the problem of estimating parameters of 
individual subjects outlined above. Under the assumption that participants within each group are 
similar to each other, but not identical, a hierarchical model can be constructed where individual 



parameter estimates are constrained by group- level distributions (Nilsson et al. 2011 ShifFrin et al. 



20081. 



Bayesian methods require specification of a generative process in form of a likelihood function that 
produced the observed data x given some parameters 9. By specifying our prior belief we can use 
Bayes formula to invert the generative model and make inference on the probability of parameters 0: 



p{e\x) 



p{x\e)*p{e) 
W) 



(1) 



Where P{x\9) is the likelihood and P{0) is the prior probability. Computation of the marginal 
likelihood P{x) requires integration (or summation in the discrete case) over the complete parameter 
space 6: 
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p(x) = / p{x\e)de (2) 

Je 

Note that in most scenarios this integral is analytically intractable. Sampling methods like Markov- 
Chain Monte Carlo (MCMC) ( [Gamerman and Lopes 2006) circumvent this problem by providing a 
way to produce samples from the posterior distribution. These methods have been used with great 



success in many different scenarios (Gelman et al. 20031 and will be discussed in more detail below. 

Another nice property of the Bayesian method is that it lends itself naturally to a hierarchical 
design. In such a design, parameters for one distribution can themselves come from a different distri- 
bution which allows chaining together of distributions of arbitrary complexity and map the structure 
of the data onto the model. 

This hierarchical property has a particular benefit to cognitive modeling where data is often scarce. 
We can construct a hierarchical model to more adequately capture the likely similarity structure of 
our data. As above, observed data points of each subject Xi^j (where i = 1,... ,Sj data points per 
subject and j ~ 1, . . . , N foi N subjects) are distributed according to some likelihood function f\9. 
We now assume that individual subject parameters 9j are normal distributed around a group mean 
with a specific group variance (A — {n,<j) with hypcrprior Gq) resulting in the following generative 
description: 



fi.a^GoO (3) 



^^{^l,a') (4) 

fiOj) (5) 



See figure 10 for the corresponding graphical model description. 

Another way to look at this hierarchical model is to consider that our fixed prior on 9 from formula 
([!]) is actually a random variable (in our case a normal distribution) parameterized by A which leads 
to the following posterior formulation: 

Pi9,Mx) = ^m^^P^l^ (6) 
P{x) 
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i — 1, . . . , Sj 



Figure 10: Graphical notation of a hierarchical model. Circles represent continuous random vari- 
ables. Arrows connecting circles specify conditional dependence between random variables. Shaded 
circles represent observed data. Finally, plates around graphical nodes mean that multiple identical, 
independent distributed random variables exist. 

Note that we can factorize P{x\9) and P{9\X) due to their conditional independence. This formu- 
lation also makes apparent that the posterior contains estimation of the individual subject parameters 
0j and group parameters A. 

Several inference methods to estimate the posterior distribution have been developed. For details 
on commonly used sampling algorithms, see the appendix. 

In sum, hierarchical Bayesian estimation leverages similarity between individual subject to share 
statistical power and increase sensitivity in our parameter estimation. However, note that in our 
computational psychiatry application the homogeneity assumption that all subjects come from the 
same normal distribution is almost certainly violated (see above). To deal with the heterogeneous 
data often encountered in psychiatry I will discuss mixture models further down below. Next, I will 
describe algorithms to estimate this posterior distribution. 

3.2.3 Likelihood-free methods 

Most models in cognitive neuroscience and mathematical psychology like the DDM are described by 
a latent generative process (see e.g. ([7|). However, this generative description is usually ill suited 
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for inference as it does not directly provide us with a closed-form likelihood function p{x\6) of how 
observed data (e.g. the wiener first passage time; see above) arise from this generative process. 



While the DDM is used partly because it has a tractable likelihood functions, many interesting 
considerations require models for which a generative process but no tractable likelihood can be speci- 



fied. Recent examples of these efforts include changes of mind as new evidence is processed ( Resulaj 



et al. 2009 ), the influence of attention (Krajbich and Rangel 2011 ), and reward-based decision making 



given conflict in values of alternative actions (Cavanagh et al. 2011 Ratclifi^ and Frank 2012). In these 
cases, a likelihood function must instead be approximated by simulation from the generative process. 
This type of inference is commonly called likelihood free. 



It can be expected that the full spectrum of cognitive function will be relevant in computational 
psychiatry. Being able to formulate (and estimate) drift-diffusion models very flexibly potentially 
allows its application to new domains. 

For a detailed explanation of these methods, see the appendix. 

3.3 Model selection 

Computational models often allow formulation of several plausible accounts of cognitive behavior. We 
anticipate that this problem will also occur in computational psychiatry where multiple theories of 
cognitive dysfunction must be tested. One way to differentiate between these various plausible hy- 
pothesis as expressed by alternative models is model comparison. In the following I will review various 
methods and metrics to compare hierarchical models. The most critical property for model comparison 
is that model complexity gets penalized because more complex models have greater explanatory power 
by design. Several model comparison measures have been devised. I refer the reader to the appendix 
section for mathematical details. 

3.4 Mixture Models 

In this section I will review different mixture models that allow estimation of clusters in data in 
a Bayesian framework. These are relevant to our objective as they (i) deal with the heterogeneity 
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encountered in computational psychiatry and (ii) have the potential to bootstrap a new psychiatric 
classification system based on measurable, quantitative, computational endophenotypes. Because we 
are describing a toolbox using hierarchical Bayesian estimation techniques I will focus this section on 
mixture models as they are easily integrated into this framework. Where possible, I will highlight 
connections to more traditional clustering methods like k-means. 

3.4.1 Gaussian Mixture Models 

GMMs assume parameters to be distributed according to one of several Gaussian distributions (i.e. 
clusters). Specifically, given the number of clusters fc, each cluster mean and variance gets estimated 
from the data. This type of model is capable of solving our above identified problem of assuming 
heterogeneus subjects to be normally distributed: by allowing individual subject parameters to be 
assigned to different clusters we allow estimation of different sub-groups in our patient and healthy 
population. Note, however, that the number k of how many clusters should be estimated must be 
specified a-priori in a GMM and remain fixed for the course of the estimation. This is problematic as 
we do not necessarily know how many sub-groups to expect in advance. As we will see below, Bayesian 
non-parametrics solve this issue by inferring the number of clusters from data. 

3.4.2 Dirichlet Process Gaussian Mixture Models 

Dirichlet processes Gaussian mixture models (DPGMMs) belong to the class of Bayesian non-parametrics 



(Antoniak 1974 1. They can be viewed as a variant of GMMs with the critical difference that they as- 



sume an infinite number of potential mixture components (see Gershman and Blei (2012 1 for a review) 



Mixture models can infer sub-groups when the data is heterogeneous as is generally the case in 
patient populations. While the mindset describing these methods was their application towards the 
SSM their applicability is much more general than that. For example, the case-studies described above 
which used, among others, RL models to identify differences between HC and psychiatric patients could 
easily be embedded into this hierarchical Bayesian mixture model framework I outlined here. There 
are multiple benefits to such an approach. First, computational models fitted via hierarchical Bayesian 
estimation provide a tool to accurately describe the neurocognitive functional profile of individuals. 
Second, the mixture model approach is ideally suited to deal with the heterogeneity in patients but 
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also healthy controls (Fair et al. 2012 1. Third, by testing psychiatric patients with a range of diagnoses 



(as opposed to most previous research studies that only compare patients with a single diagnosis, e.g. 



SZ, to controls) we might be able to identify shared pathogenic cascades as suggested by Buckholtz 



and Meyer-Lindenberg (2012 1. 



4 Conclusions 



As outlined above, computational psychiatry is an emerging field that shows great promise to under- 
stand aberrant biological processes in mental disease and address current challenges encountered in 
mental health research. By fitting computational models to behavioral data we can construct com- 
putational multi-dimensional features to replace symptom-based classification as implemented by the 
DSM. Decision making appears to provide a good framework for studying psychiatric disease as many 
disorders show abnormalities in core decision making processes. Sequential sampling models have a 
good track record in describing individual differences and can be linked to neuronal processes. Hi- 
erarchical Bayesian estimation provides a compelling toolbox to fit these models directly to data as 
it (i) provides an uncertainty measure; (ii) allows estimation of individual and group-level parame- 
ters simultaneously; (iii) allows for direct model comparison; (iv) can be used in scenarios where a 
likelihood can not be easily formulated; and (v) enable deconstruction of symptoms by identifying 
latent clusters. For example, impulsivity is a core symptom of impulse control disorders like ADHD, 



OCD, Tourette syndrome, substance abuse and eating disorder (Robbins et al. 2012). Computational 



cognitive models have already started to deconstruct this broadly defined behavioral symptom and 



identified separate pathways that can all lead to alterations in impulse control (Dalley et al. 2011 1 in- 



cluding reduced motor inhibition ( Chamberlain et al. 2006 2008 ) early temporal discounting of future 



rewards, insensitivity towards negative relative to positive outcomes (Frank et al. 2007b Cockburn 



and Holroyd 2010), or an inability to adjust the decision threshold appropriately (Mulder et al. 2010 



Cavanagh et al. 2011 Frank et al. 2007a). Ultimately, the hope is to find a way to describe and 



diagnose psychiatric disease based on objective computational neurocognitive markers rather than the 
current subjective symptom-based approach. I believe that this combination of computational tools 
described here-within is powerful enough to lead the charge towards a new level of understanding of 
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mental illness based on identifiable and reproducible neurocognitive computational multi-dimensional 
features. 



5 Appendix 

The following serves as a reference for the mathematical details of the methods motivated above. 
5.1 Drift-Diffusion Model 

Mathematically, the DDM is defined by a stochastic differential equation called the Wiener process 
with drift: 

dWr^N{v,(j^) (7) 

where v represents the drift-rate and a the variance. As we often only observe the response times 
of subjects we are interested in the wiener first passage time (wfpt) - the time it takes W to cross 
one of two boundaries. Assuming two absorbing boundaries of this process and through some fairly 



sophisticated math (see e.g. Smith 20001 it is possible to analytically derive the time this process 
will first pass one of the two boundaries (i.e. the wiener first passage time; wfpt). This probability 
distributior0 then serves as the likelihood function for the DDM. 



5.2 Bayesian Inference 

5.2.1 Empirical Bayesian Approximation 

Empirical Bayes can be regarded as an approximation of equation ([6]). To derive this approximation 
consider P{6\x) which we can calculate by integrating over P(A): 

P(eW = ^^ / ^(^IWA)dA (8) 

Now, if the true distribution P{0\\) is sharply peaked, the integral can be replaced with the point 
estimate of its peak A*: 

^the wfpt will not be a distribution rather than a single value because of the stochasticity of the wiener process 
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p{e\x) 



p{x\e)p{e\x*) 



(9) 



P{x\\*) 

Note, however, that A* depends itself on P{9\x). One algorithm to solve this interdependence is 



Expectation Maximization (EM) ( Dempster et al.[ 1977). EM is an iterative algorithm that alternates 
between computing the expectation of P{6\x) (this can be easily done by Laplace Approximation 



( Azevedo-filho and Shachter 1994)) and then maximizing the prior point estimate A* based on the 
current values obtained by the expectation step. This updated point estimate is then used in turn to 
recompute the expectation. The algorithm is run until convergence or some other criterion in reached. 



This approach is used for example by Huys et al. (2012) to fit their reinforcement learning models. 



5.2.2 Markov-Chain Monte-Carlo 

As mentioned above, the posterior is often intractable to compute analytically. While Empirical Bayes 
provides a useful approximation, an alternative approach is to estimate the full posterior by drawing 
samples from it. One way to achieve this is to construct a Markov-Chain that has the same equilib- 



rium distribution as the posterior (Gamerman and Lopes 2006). Algorithms of this class are called 



Markov-Chain Monte Carlo (MCMC) samplers. 



One common and widely applicable algorithm is Metropolis-Hastings (Chib and Greenberg 1995 



Andrieu et al. 2003). Assume we wanted to generate samples 9 from the posterior p{9\x). In general, 
we can not sample from p{9\x) directly. Metropolis-Hastings instead generates samples 0* from a 
proposal distribution q{d*\9*^^) where the next position 6* only depends on the previous position 
at 9*~^ (i.e. the Markov-property). For simplicity we will assume that this proposal distribution is 



symmetrical; i.e. q{9'^\9'^ ^) 
distribution, formally: 



^\6*). A common choice for the proposal distribution is the Normal 



'Af{9*-\a^) 



The proposed jump to 9* is then accepted with probability a: 



(10) 



. p{9'\x) . 
a = mm 1, 

p(0^ '-\x) 



(11) 
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In other words, the probabihty of accepting a jump depends on the probabihty ratio of the proposed 
jump position 0* to the previous position 9*^^. Critically, in this probability ratio, the intractable 
integral in the denominator (i.e. p{x) = J p{x\9) dO) cancels out. This can be seen by applying Bayes 
formula ([T]): 

Pi^\0'-^'M0'-') p{x\e^-^)p{9t-^) ^ ' 

Thus, to calculate the probability of accepting a jump we only have to evaluate the likelihood and 
prior, not the intractable posterior. 

Note that 0" has to be initialized at some position and can not directly be sampled from the poste- 
rior. From this initial position, the Markov chain will explore other parts of the parameter space and 
only gradually approach the posterior region. The first samples generated are thus not from the true 
posterior and are often discarded as "burn-in" . Note moreover that once the algorithm reaches a region 
of high probability it will continue to explore lower probability regions in the posterior, albeit with 
lower frequency. This random-walk behavior is due to the probability ratio a which allows Metropolis- 
Hastings to also sometimes accept jumps from a high probability position to a low probability position. 

Another common algorithm is Gibbs sampling that iteratively updates each individual random 



variable conditional on the other random variables set to their last sampled value (e.g Frey and Jojic 



20051. Starting at some configuration 9^ , the algorithm makes T iterations over each random variable 
9i. At each iteration t each random variable is sampled conditional on the current (t — 1) value of all 
other random variables that it depends on: 

e\-p{ef^\0t^l^) (13) 

Critically, of^^ are treated as constant. The sampled value of 9^^ will then be treated as fixed 
while sampling the other random variables. 

Note that while Gibbs sampling never rejects a sample (which often leads to faster convergence 
and better mixing), in contrast to Metropolis-Hastings, it does require sampling from the conditional 
distribution which is not always tractable. 
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5.3 Likelihood free methods 



Several likelihood-free methods have emerged in the past (for a review, see Turner and Van Zandt 



(2012 1). Instead of an analytical solution of the likelihood function, these methods require a sampling 
process that can simulate a set of data points from a generative model for each 0. We will call the 
simulated data y and the observed data x. Approximate Bayesian Computation (ABC) relies on a 
distance measure p{x, y) that compares how similar the simulated data y is to the observed data x 
(commonly, this distance measure relies on summary statistics). We can then use the Metropolis- 



Hastings algorithm introduced before and change the acceptance ration a (111 to use p{x,y) instead 
of a likelihood function. 

plf^-^) ) ifp(a;,y)<eo ^^^^ 
if p{x, y) > eo 

where eo is an acceptance threshold. Large eo will result in higher proposal acceptance probability 
but a worse estimation of the posterior while small eo will lead to better posterior estimation but 
slower convergence. 

An alternative approach to ABC is to construct a synthetic likelihood function based on summary 



statistics (Wood 2010). Specifically, we sample Nr multiple data sets yi....,Nr from the generative 



process. We then compute summary statistics si,...,Ar^ for each simulated data sei[^ Based on these 



summary statistics we then construct the synthetic likelihood function to evaluate 9 (see figure 11 for 
an illustration): 

pix\e)^MiSix);fie,^e) (15) 

This synthetic likelihood function based on summary statistics can then be used as a drop-in 
replacement for e.g. the Metropolis-Hastings algorithm outlined above. 



^The summary statistics must (i) be suiHcient and (ii) normally distributed 
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Figure 11: Construction of a synthetic likelihood. To evaluate parameter vector 0, Nr data sets 
yi,...,Nr s-re sampled from the generative model. On each sampled data set summary statistics si^,,,^Nr 
are computed. Based on these summary statistics a multivariate normal distribution is approximated 
with mean pig and covariance matrix Tig. The likelihood is approximated by evaluating summary 
statistics of the actual data on the log normal distribution with the estimated iig and Tig. Reproduced 



from (Wood 2010) 
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5.4 Model Comparison 

5.4.1 Deviance Information Criterion 

The Deviance Information Criterion (DIG) is a measure wiiicli trades off model complexity and model 



fit ( Spiegelhalter et al. 2002). Several similar measures exist such as Akaike Information Criterion 
(AIC) and the Bayesian Information Criterion (BIC). However, both these measures use the number 
of parameters as a proxy for model complexity. While a reasonable approximation to the complexity 
of non-hierarchical models, the relationship between model parameters (some of which are latent) and 
complexity in hierarchical models is more intricate. The DIC measure instead infers the number of 
parameters from the posterior. The DIC is computed as follows: 



Die = D+pD (16) 

where 



pD^D-D (17) 

D is the posterior mean of the deviance (i.e. —2 * log(likelihood)) and £) is a point estimate of the 
deviance obtained by substituting in the posterior means. Loosely, D represents how well the model 
fits the data on average while D captures the deviance at the best fitting parameter combination. pD 
then acts as a measure related to the posterior variability and used as a proxy for the effective number 
of parameters. Complex models with many parameters will tend to have higher posterior variability 
and thus result in increased pD penalization. 

Note that the only parameters that affect D directly in our hierarchical model (equation [g]) are the 
subject parameters 9i. Thus, DIC estimates model fit based on how well individual subjects explain 
the observed data. 

5.4.2 BIC 

The Bayesian Information Criterion (BIC) is defined as follows: 
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BIG = -2 * logp(a;|6'*^^) + k* log(n 



(18) 



where k is the number of free parameters, n is the number of data points, x is the observed data 



and logp{x\k) is the hkehhood of the parameters given the data (Schwarz 1978). 

While BIG can not directly be applied to hierarchical models (as outlined above), it is possible to 



integrate out individual subject parameters (e.g. Huys et al. 2012): 



logp(a;|0"^") = / d/i (19) 

where Xi is the data belonging to the ith subject. The resulting score is called integrated BIG. 



Since the subject parameters are integrated out, integrated BIG estimates how well the group 
parameters are able to explain the observed data. 

5.4.3 Bayes Factor 



Another measure to compare two models is the Bayes Factor (BF) (Kass and Raftery 1995). It is 



defined as the ratio between the marginal model probabilities of the two models: 



BF = 



p{x\Mi) _ fp{ei\Mi)p{x\ei,M,)d0i 



(20) 



p{x\M2) J pie2\M2)p{x\e2,M2)de2 

The magnitude of this ratio informs the degree one should belief in one model compared to the other 



As BF integrates out subject and group parameters this model comparison measure should be used 
when different classes of models are to be compared in their capacity to explain observed data. 
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5.5 Mixture Models 

5.5.1 Gaussian Mixture Models 

Mixture models infer k number of clusters in a data set. The assumption of normal distributed clusters 
leads to a Gaussian Mixture Model (GMM) with a probability density function as follows: 

K 

p{x\tt,iii^...^k,cti,...,k) = ^TrkAf{xi\fik,crl) (21) 

fc=i 

Each observed data point Xi can be created by drawing a sample from the normal distribution 
selected by the unobserved indicator variable zt which itself is distributed according to a multinomial 
distribution tt: 

/^fe,«^fe~Go() (22) 

Zir^W (23) 

Xir^J\fin,^,al) (24) 

where the base measure Gq defines the prior for /i^ and (Tk- To simplify the inference it is often 
advisable to use a conjugate prior for these paramters. For example, the normal distribution is the 
conjugate prior for a normal distribution with known variance: 



/Ufe j\^(/io,o-o) (25) 
In a similar fashion, we can assign the mixture weights a symmetric Dirichlet prior: 



^~Dir(-|,...,-|) (26) 

Note that the GMM assumes a mixture distribution on the level of the observed data Xi. However, 
in our relevant case of a multi-level hierarchical model we need to place the mixture at the level 
of the latent subject parameters instead of the observed data. As before, we use the subject index 
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fik,<7k -GoO (27) 

TT - Dir(a) (28) 

Zj ^ Categorical(7r) (29) 

0,^U{fi.^,al) (30) 

X,., ^ (31) 



Where f denotes the Hkelihood function. 



Interestingly, the famous K-Means clustering algorithm is identical to a Gaussian Mixture Model 
(GMM) in the limit ct^ — > (Kulis et al. , 2012 1. K-Means is an expectation maximization (EM) 
algorithm that alternates between an expectation step during which data points are assigned to their 
nearest cluster centroids and a maximization step during which new cluster centroids are estimated. 
This algorithm is repeated until convergence is reached (i.e. no points are reassigned to new clusters). 



5.5.2 Dirichlet Process Gaussian Mixture Models 

oo 

p{x\ ) = ^nk^^{x,\^lk,(Tl) (32) 

k=l 

As above, we specify our generative mixture model: 



pifc,crfe GoO 

Zi ^ Categorical(7r) 
~7V(^^,,cr^J 



(33) 
(34) 
(35) 



with the critical difference of replacing the hyperprior tt with the stick breaking process ( Sethuraman 
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71, 



1-P; 



71. 



+ 



Pi 




(i = 1 



£1 = 5 



Figure 12: Left: Stick-breaking process. At each iteration (starting from the top) a tt is broken off 
with relative length ~ Beta(l,a). Right: Histogram over different realizations of the stick-breaking 
process. As can be seen, higher values of hyperprior a lead to a more spread out distribution. Taken 
from Eric Sudderth's PhD thesis. 



19911: 



TT ^ StickBreaking(Q!) 



(36) 



The stick-breaking process is a realization of a Dirichlet process (DP). Specifically, tt — {tta;}^]^ is 
an infinite sequence of mixture weights derived from the following process: 



(3k ~ Beta(l, a) 



(37) 
(38) 



1=1 



with a > 0. See figure 12 for a visual explanation. 



The Chinese Restaurant Process (CRP) - named after the apparent infinite seating capacity in 
Chinese restaurants - allows for a more succinct model formulation. Consider that customers Zi are 
coming into the restaurant and are seated at table k with probability: 
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p{Zi ^ k\zi „_i,q;, A) = — 

n — 1 + a 

where k — 1 ... K is the table and Uk is the number of customers already sitting at table k (see 



figure 13 for an illustration). It can be seen that in the limit as iiT — >■ cx) this expression becomes: 



n — 1 + a 

Thus, as customers are social, the probability of seating customer Zi to table k is proportional the 
number of customers already sitting at that table. This desirable clustering property is also known as 
the "rich get richer" . 



Note that for an individual empty table k at which no customer has been seated (i.e. = 0) the 
probability of seating a new customer to that table goes to in the limit as if oo. However, at 
the same time the number of empty tables approaches infinity. Consider that we have so far seated L 
customers to tables and the set Q contains all empty tables such that there are |Q| = K — L empty 
tables in the restaurant. The probability of seating a customer Zi at an empty table becomes: 

p{z^ e Q|zi_.. „_i,a) = — — 

n — 1 + a 

As can be seen, the probability of starting a new table is proportional to the concentration param- 
eter a. Intuitively, large values of the dispersion parameter a lead to more clusters being used. 



Thus, while the Stick-Breaking process sampled mixture weights from which we had to infer cluster 
assignments, the CRP allows for direct sampling of cluster assignments. The resulting model can then 
be written as: 



^fc, fjfc ^ Go() 
zi,...,N ~ CRP(a) 



(39) 
(40) 
(41) 
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Figure 13: Illustration of the Chinese Restaurant Process. Customers are seated at tables with pa- 
rameters 9. The more customers are already seated at a table, the higher the probability that future 



customers are seated at the same table (i.e. clustering property). Taken from Gershman and Blei 
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Finally, in a hierarchical group model we would need to place the infinite mixture on the subject 
level rather than the observed data level: 

fik,<^k -GoO (42) 

- CRP(a) (43) 

9,^Nifi.^,al) (44) 

X,,, ~ F{ej) (45) 

See figure [M] for a graphical model description. 

Note that while the potential number of clusters is infinite, any realization of this process will 
always lead to a finite number of clusters as we always have finite amounts of data. However, this 
method allows the addition (or subtraction) of new clusters as new data becomes available. 
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